Load the necessary libraries
library(car) # for regression diagnostics
library(broom) # for tidy output
library(ggfortify) # for model diagnostics
library(sjPlot) # for outputs
library(knitr) # for kable
library(effects) # for partial effects plots
library(emmeans) # for estimating marginal means
library(ggeffects) # for plotting marginal means
library(MASS) # for glm.nb
library(MuMIn) # for AICc
library(tidyverse) # for data wrangling
library(modelr) # for auxillary modelling functions
library(DHARMa) # for residual diagnostics plots
library(performance) # for residuals diagnostics
library(see) # for plotting residuals
library(patchwork) # grid of plots
library(scales) # for more scales
An ecologist studying a rocky shore at Phillip Island, in southeastern Australia, was interested in how clumps of intertidal mussels are maintained (Quinn 1988). In particular, he wanted to know how densities of adult mussels affected recruitment of young individuals from the plankton. As with most marine invertebrates, recruitment is highly patchy in time, so he expected to find seasonal variation, and the interaction between season and density - whether effects of adult mussel density vary across seasons - was the aspect of most interest.
The data were collected from four seasons, and with two densities of adult mussels. The experiment consisted of clumps of adult mussels attached to the rocks. These clumps were then brought back to the laboratory, and the number of baby mussels recorded. There were 3-6 replicate clumps for each density and season combination.
Format of quinn.csv data files
| SEASON | DENSITY | RECRUITS | SQRTRECRUITS | GROUP |
|---|---|---|---|---|
| Spring | Low | 15 | 3.87 | SpringLow |
| .. | .. | .. | .. | .. |
| Spring | High | 11 | 3.32 | SpringHigh |
| .. | .. | .. | .. | .. |
| Summer | Low | 21 | 4.58 | SummerLow |
| .. | .. | .. | .. | .. |
| Summer | High | 34 | 5.83 | SummerHigh |
| .. | .. | .. | .. | .. |
| Autumn | Low | 14 | 3.74 | AutumnLow |
| .. | .. | .. | .. | .. |
| SEASON | Categorical listing of Season in which mussel clumps were collected independent variable |
| DENSITY | Categorical listing of the density of mussels within mussel clump independent variable |
| RECRUITS | The number of mussel recruits response variable |
| SQRTRECRUITS | Square root transformation of RECRUITS - needed to meet the test assumptions |
| GROUPS | Categorical listing of Season/Density combinations - used for checking ANOVA assumptions |
Mussel
quinn <- read_csv("../public/data/quinn.csv", trim_ws = TRUE)
glimpse(quinn)
## Rows: 42
## Columns: 5
## $ SEASON <chr> "Spring", "Spring", "Spring", "Spring", "Spring", "Spring…
## $ DENSITY <chr> "Low", "Low", "Low", "Low", "Low", "High", "High", "High"…
## $ RECRUITS <dbl> 15, 10, 13, 13, 5, 11, 10, 15, 10, 13, 1, 21, 31, 21, 18,…
## $ SQRTRECRUITS <dbl> 3.872983, 3.162278, 3.605551, 3.605551, 2.236068, 3.31662…
## $ GROUP <chr> "SpringLow", "SpringLow", "SpringLow", "SpringLow", "Spri…
summary(quinn)
## SEASON DENSITY RECRUITS SQRTRECRUITS
## Length:42 Length:42 Min. : 0.00 Min. :0.000
## Class :character Class :character 1st Qu.: 9.25 1st Qu.:3.041
## Mode :character Mode :character Median :13.50 Median :3.674
## Mean :18.33 Mean :3.871
## 3rd Qu.:21.75 3rd Qu.:4.663
## Max. :69.00 Max. :8.307
## GROUP
## Length:42
## Class :character
## Mode :character
##
##
##
Since we intend to model both SEASON and DENSITY as categorical variables, we need to explicitly declare them as factors.
quinn <- quinn %>% mutate(
SEASON = factor(SEASON, levels = c("Spring", "Summer", "Autumn", "Winter")),
DENSITY = factor(DENSITY)
)
Model formula: \[ \begin{align} y_i &\sim{} \mathcal{Pois}(\lambda_i)\\ ln(\mu_i) &= \boldsymbol{\beta} \bf{X_i}\\[1em] \end{align} \]
where \(\boldsymbol{\beta}\) is a vector of effects parameters and \(\bf{X}\) is a model matrix representing the intercept and effects of season, density and their interaction on mussel recruitment.
ggplot(quinn, aes(y = RECRUITS, x = SEASON, fill = DENSITY)) +
geom_boxplot()
Conclusions:
Lets mimic the effect of using a log link, by using log scaled y-axis.
ggplot(quinn, aes(y = RECRUITS, x = SEASON, fill = DENSITY)) +
geom_boxplot() +
scale_y_log10()
Conclusions:
We actually have a couple of initial options:
log transform the response variable (RECRUITS) and fit against a Gaussian distribution. If we do so, we would need to amend the response for cases when the response is zero (as the log of 0 is not defined). We could simply add 1 to each count before log transformed.
fit against a Poisson distribution with a log link
I will include this, just for an illustration on how to fit such a model, we will not pursue it further.
quinn.glmG <- glm(log(RECRUITS + 1) ~ DENSITY * SEASON, data = quinn, family = gaussian)
quinn.glm <- glm(RECRUITS ~ DENSITY * SEASON,
data = quinn,
family = poisson(link = "log")
)
quinn.glm %>% autoplot(which = 1:6)
Conclusions:
quinn.glm %>% performance::check_model()
quinn.glm %>% performance::check_overdispersion()
## # Overdispersion test
##
## dispersion ratio = 3.309
## Pearson's Chi-Squared = 112.497
## p-value = < 0.001
quinn.glm %>% performance::check_zeroinflation()
## # Check for zero-inflation
##
## Observed zeros: 2
## Predicted zeros: 0
## Ratio: 0.00
Conclusions:
quinn.resid <- quinn.glm %>% simulateResiduals(plot = TRUE)
quinn.resid %>% testResiduals()
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.19018, p-value = 0.09584
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.797, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 4, observations = 42, p-value < 2.2e-16
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.04761905
## sample estimates:
## outlier frequency (expected: 0.00976190476190476 )
## 0.0952381
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.19018, p-value = 0.09584
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.797, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 4, observations = 42, p-value < 2.2e-16
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.04761905
## sample estimates:
## outlier frequency (expected: 0.00976190476190476 )
## 0.0952381
quinn.resid %>% testDispersion()
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 1.797, p-value < 2.2e-16
## alternative hypothesis: two.sided
quinn.resid %>% testZeroInflation()
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 7.6923, p-value = 0.04
## alternative hypothesis: two.sided
# testTemporalAutocorrelation(quinn.glm1)
Conclusions:
## goodness of fit
1 - pchisq(quinn.glm$deviance, df = quinn.glm$df.residual)
## [1] 2.457479e-12
## any evidence of overdispersion
quinn.glm$deviance / quinn.glm$df.residual
## [1] 3.677366
Conclusions:
quinn %>%
group_by(SEASON, DENSITY) %>%
summarise(
Zeros = sum(RECRUITS == 0),
Prop = Zeros / n(),
Mean = mean(RECRUITS)
)
x <- rpois(100000, lambda = 2.67)
tab.1 <- table(x == 0)
tab.1 / sum(tab.1)
##
## FALSE TRUE
## 0.93071 0.06929
## OR, over the entire data
## is this due to excessive zeros (zero-inflation)
tab <- table(quinn$RECRUITS == 0)
tab / sum(tab)
##
## FALSE TRUE
## 0.95238095 0.04761905
## 5% is not many.. so it cant be zero-inflated
## how many 0's would we expect from a poisson distribution with a mean similar to our mean
mean(quinn$RECRUITS)
## [1] 18.33333
x <- rpois(100000, lambda = mean(quinn$RECRUITS))
tab.1 <- table(x == 0)
tab.1 / sum(tab.1)
##
## FALSE
## 1
Conclusion:
In light of the discovery that the Poisson model was over-dispersed, we will explore a negative binomial model. Rather than assume that the variance is equal to the mean (dispersion of 1), the dispersion parameter is estimated. Negative binomial models are also able to account for some level of zero-inflation.
quinn.nb <- MASS::glm.nb(RECRUITS ~ DENSITY * SEASON, data = quinn)
quinn.nb %>% autoplot(which = 1:6)
quinn.nb %>% performance::check_model()
quinn.resid <- quinn.nb %>% simulateResiduals(plot = TRUE)
quinn.resid %>% testResiduals()
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.11213, p-value = 0.626
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.89834, p-value = 0.624
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 1, observations = 42, p-value = 0.78
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.04761905
## sample estimates:
## outlier frequency (expected: 0.0111904761904762 )
## 0.02380952
## $uniformity
##
## One-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.11213, p-value = 0.626
## alternative hypothesis: two-sided
##
##
## $dispersion
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.89834, p-value = 0.624
## alternative hypothesis: two.sided
##
##
## $outliers
##
## DHARMa bootstrapped outlier test
##
## data: simulationOutput
## outliers at both margin(s) = 1, observations = 42, p-value = 0.78
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000000 0.04761905
## sample estimates:
## outlier frequency (expected: 0.0111904761904762 )
## 0.02380952
quinn.resid %>% testDispersion()
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.89834, p-value = 0.624
## alternative hypothesis: two.sided
quinn.resid %>% testZeroInflation()
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 6.1728, p-value = 0.104
## alternative hypothesis: two.sided
## goodness of fit
1 - pchisq(quinn.nb$deviance, df = quinn.nb$df.residual)
## [1] 0.01311451
## any evidence of overdispersion
quinn.nb$deviance / quinn.nb$df.residual
## [1] 1.614218
AICc(quinn.glm, quinn.nb)
Conclusions:
Conclusions:
quinn.nb %>% plot_model(type = "eff", terms = c("SEASON", "DENSITY"))
quinn.nb %>%
allEffects() %>%
plot(multiline = TRUE, ci.style = "bar")
quinn.nb %>%
allEffects() %>%
plot(multiline = TRUE, ci.style = "bar", type = "link")
quinn.nb %>%
ggpredict(c("SEASON", "DENSITY")) %>%
plot()
quinn.nb %>%
ggemmeans(~ SEASON * DENSITY) %>%
plot()
Lets start with the estimated coefficients on the link (log) scale
quinn.nb %>% summary()
##
## Call:
## MASS::glm.nb(formula = RECRUITS ~ DENSITY * SEASON, data = quinn,
## init.theta = 9.022467857, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8704 -0.8274 0.0000 0.4999 2.1866
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1875 12.283 < 2e-16 ***
## DENSITYLow 0.1133 0.2742 0.413 0.67934
## SEASONSummer 1.5721 0.2389 6.581 4.69e-11 ***
## SEASONAutumn 0.6763 0.2492 2.714 0.00664 **
## SEASONWinter -0.5680 0.2881 -1.971 0.04870 *
## DENSITYLow:SEASONSummer -0.8970 0.3509 -2.556 0.01059 *
## DENSITYLow:SEASONAutumn -0.1881 0.3788 -0.496 0.61955
## DENSITYLow:SEASONWinter -0.8671 0.5338 -1.624 0.10432
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(9.0225) family taken to be 1)
##
## Null deviance: 183.269 on 41 degrees of freedom
## Residual deviance: 54.883 on 34 degrees of freedom
## AIC: 293.09
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 9.02
## Std. Err.: 3.69
##
## 2 x log-likelihood: -275.095
Conclusions:
quinn.nb %>% tidy(conf.int = TRUE)
Now if we exponentiate to put these estimates on the response scale:
quinn.nb %>% tidy(conf.int = TRUE, exponentiate = TRUE)
Conclusions:
If we compare the above to the over-dispersed Poisson, we see that the estimates are the same, but that the standard errors are underestimated and hence the confidence intervals are narrower (for over-dispersed model)
quinn.glm %>% tidy(conf.int = TRUE, exponentiate = TRUE)
In order to tease apart the significant interaction(s), it might be useful to explore the effect of Density separately within each Season.
quinn.nb %>%
emmeans(~ DENSITY | SEASON) %>%
pairs() %>%
summary(infer = TRUE)
Conclusions:
Or we could express these on a factor (fold/ratio) scale.
quinn.nb %>%
emmeans(~ DENSITY | SEASON, type = "response") %>%
pairs()
## SEASON = Spring:
## contrast ratio SE df z.ratio p.value
## High / Low 0.893 0.245 Inf -0.413 0.6793
##
## SEASON = Summer:
## contrast ratio SE df z.ratio p.value
## High / Low 2.189 0.480 Inf 3.577 0.0003
##
## SEASON = Autumn:
## contrast ratio SE df z.ratio p.value
## High / Low 1.078 0.282 Inf 0.286 0.7749
##
## SEASON = Winter:
## contrast ratio SE df z.ratio p.value
## High / Low 2.125 0.973 Inf 1.646 0.0999
##
## Tests are performed on the log scale
Or perhaps even an absolute difference scale.
quinn.nb %>%
emmeans(~ DENSITY | SEASON, type = "response") %>%
regrid() %>%
pairs()
## SEASON = Spring:
## contrast estimate SE df z.ratio p.value
## High - Low -1.20 2.92 Inf -0.411 0.6812
##
## SEASON = Summer:
## contrast estimate SE df z.ratio p.value
## High - Low 26.17 7.97 Inf 3.284 0.0010
##
## SEASON = Autumn:
## contrast estimate SE df z.ratio p.value
## High - Low 1.42 4.92 Inf 0.288 0.7734
##
## SEASON = Winter:
## contrast estimate SE df z.ratio p.value
## High - Low 3.00 1.64 Inf 1.829 0.0673
Conclusions:
quinn.nb %>%
emmeans(~ DENSITY | SEASON, type = "response") %>%
pairs() %>%
summary(infer = TRUE)
It might be useful to capture these pairwise contrasts so that we can graph them as a caterpillar plot.
In the following, I will present the effects on a log (based 2) axis. Such a scale presents doubling and halving (etc) equidistant from 1.
eff <- quinn.nb %>%
emmeans(~ DENSITY | SEASON, type = "response") %>%
pairs() %>%
summary(infer = TRUE) %>%
as.data.frame()
eff %>%
ggplot(aes(y = ratio, x = SEASON)) +
geom_pointrange(aes(ymin = asymp.LCL, ymax = asymp.UCL)) +
geom_text(aes(label = sprintf("p=%.2f", p.value)), nudge_x = 0.25) +
geom_hline(yintercept = 1, linetype = "dashed") +
scale_x_discrete(name = "") +
scale_y_continuous(
name = "Density effect (High vs Low)", trans = scales::log2_trans(),
breaks = scales::breaks_log(base = 2)
) +
coord_flip(ylim = c(0.25, 4)) +
theme_classic()
As these summarise only involve categorical predictors, there is no need to define a prediction grid. For categorical predictors, the default grid will assume that you are interested in all the levels of the categorical predictors.
newdata <- emmeans(quinn.nb, ~ DENSITY | SEASON, type = "response") %>% as.data.frame()
head(newdata)
ggplot(newdata, aes(y = response, x = SEASON, fill = DENSITY)) +
geom_pointrange(aes(ymin = asymp.LCL, ymax = asymp.UCL, shape = DENSITY),
position = position_dodge(width = 0.2)
) +
theme_classic() +
theme(
axis.title.x = element_blank(),
legend.position = c(0.01, 1), legend.justification = c(0, 1)
) +
annotate(geom = "text", x = "Summer", y = 70, label = "*", size = 7) +
scale_shape_manual(values = c(21, 22))
Unfortunately, DHARMa and performance do not work with zeroinf() models.
library(pscl)
library(glmmTMB)
quinn.zip <- zeroinfl(RECRUITS ~ DENSITY * SEASON | 1, data = quinn, dist = "poisson")
quinn.zip <- glmmTMB(RECRUITS ~ DENSITY * SEASON, zi = ~1, data = quinn, family = poisson())
quinn.resid <- quinn.zip %>% simulateResiduals(plot = TRUE)
## The following does not work due to a lack of pearson residuals for glmmTMB
## quinn.zip %>% performance::check_overdispersion()
quinn.zip %>% summary()
## Family: poisson ( log )
## Formula: RECRUITS ~ DENSITY * SEASON
## Zero inflation: ~1
## Data: quinn
##
## AIC BIC logLik deviance df.resid
## 320.6 336.2 -151.3 302.6 33
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1291 17.836 < 2e-16 ***
## DENSITYLow 0.1133 0.1858 0.610 0.54191
## SEASONSummer 1.5721 0.1419 11.081 < 2e-16 ***
## SEASONAutumn 0.6763 0.1586 4.266 1.99e-05 ***
## SEASONWinter -0.5680 0.2147 -2.646 0.00815 **
## DENSITYLow:SEASONSummer -0.8970 0.2134 -4.202 2.64e-05 ***
## DENSITYLow:SEASONAutumn -0.1881 0.2381 -0.790 0.42957
## DENSITYLow:SEASONWinter 0.2165 0.4534 0.478 0.63300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0037 0.7305 -4.112 3.92e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# tidy(quinn.zip, conf.int=TRUE, exponentiate = TRUE)
exp(-3.0037)
## [1] 0.0496032
## quinn.zip1 <- zeroinfl(RECRUITS ~ DENSITY*SEASON | SEASON, data=quinn, dist='poisson')
quinn.zip1 <- glmmTMB(RECRUITS ~ DENSITY * SEASON, zi = ~SEASON, data = quinn, family = poisson())
quinn.resid <- quinn.zip1 %>% simulateResiduals(plot = TRUE)
quinn.zip1 %>% summary()
## Family: poisson ( log )
## Formula: RECRUITS ~ DENSITY * SEASON
## Zero inflation: ~SEASON
## Data: quinn
##
## AIC BIC logLik deviance df.resid
## 320.0 340.9 -148.0 296.0 30
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1291 17.836 < 2e-16 ***
## DENSITYLow 0.1133 0.1858 0.610 0.54191
## SEASONSummer 1.5721 0.1419 11.081 < 2e-16 ***
## SEASONAutumn 0.6763 0.1586 4.266 1.99e-05 ***
## SEASONWinter -0.5680 0.2147 -2.646 0.00815 **
## DENSITYLow:SEASONSummer -0.8970 0.2134 -4.202 2.64e-05 ***
## DENSITYLow:SEASONAutumn -0.1881 0.2381 -0.790 0.42957
## DENSITYLow:SEASONWinter 0.2291 0.4375 0.524 0.60045
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.468 8393.486 -0.002 0.998
## SEASONSummer -3.275 42171.914 0.000 1.000
## SEASONAutumn -3.527 52021.072 0.000 1.000
## SEASONWinter 19.214 8393.486 0.002 0.998
plogis(-20.468)
## [1] 1.290805e-09
plogis(-20.468 + 19.214)
## [1] 0.2220085
## quinn.zinb <- zeroinfl(RECRUITS ~ DENSITY*SEASON | 1, data=quinn, dist='negbin')
quinn.zinb <- glmmTMB(RECRUITS ~ DENSITY * SEASON, zi = ~SEASON, data = quinn, family = nbinom2())
quinn.resid <- quinn.zinb %>% simulateResiduals(plot = TRUE)
AICc(quinn.zip, quinn.zip1, quinn.zinb)
quinn.zinb %>% summary()
## Family: nbinom2 ( log )
## Formula: RECRUITS ~ DENSITY * SEASON
## Zero inflation: ~SEASON
## Data: quinn
##
## AIC BIC logLik deviance df.resid
## 296.2 318.8 -135.1 270.2 29
##
##
## Dispersion parameter for nbinom2 family (): 11
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3026 0.1784 12.907 < 2e-16 ***
## DENSITYLow 0.1133 0.2605 0.435 0.66355
## SEASONSummer 1.5721 0.2246 7.000 2.56e-12 ***
## SEASONAutumn 0.6763 0.2355 2.872 0.00408 **
## SEASONWinter -0.5680 0.2764 -2.055 0.03988 *
## DENSITYLow:SEASONSummer -0.8970 0.3305 -2.714 0.00665 **
## DENSITYLow:SEASONAutumn -0.1881 0.3577 -0.526 0.59899
## DENSITYLow:SEASONWinter 0.2130 0.5887 0.362 0.71757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.772 9771.312 -0.002 0.998
## SEASONSummer -6.012 189293.632 0.000 1.000
## SEASONAutumn -5.942 200192.420 0.000 1.000
## SEASONWinter 19.507 9771.312 0.002 0.998